Goto

Collaborating Authors

 performance 0






we changed the objective function so that a specified level of fairness is guaranteed; (2) we performed experiments

Neural Information Processing Systems

We thank all of the reviewers for their helpful reviews. Adversarial Learning", and we show that it does not outperform our methods. Please see the details below. We agree that our original objective function has no guarantee that the outcome is fair. See the table below (row 1). See the table below (row 2). See the table (row 3) and the figure below. As suggested, we compared our approaches to Zhang et al. "Mitigating Unwanted Biases with Adversarial Finally, we will improve the visibility of Figure 1 in our paper.


The Disparate Benefits of Deep Ensembles

Schweighofer, Kajetan, Arnaiz-Rodriguez, Adrian, Hochreiter, Sepp, Oliver, Nuria

arXiv.org Artificial Intelligence

Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness investigates how a model's performance varies across different groups, typically defined by protected attributes such as age, gender, or race. In this work, we investigate the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups in what we refer to as a disparate benefits effect. We empirically investigate this effect with Deep Ensembles applied to popular facial analysis and medical imaging datasets, where protected group attributes are given and find that it occurs for multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify the per-group difference in predictive diversity of ensemble members as the potential cause of the disparate benefits effect. Finally, we evaluate different approaches to reduce unfairness due to the disparate benefits effect. Our findings show that post-processing is an effective method to mitigate this unfairness while preserving the improved performance of Deep Ensembles.


Beyond Performance: Quantifying and Mitigating Label Bias in LLMs

Reif, Yuval, Schwartz, Roy

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit label bias -- an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model's predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability.


Implementation and Evaluation of a System for Assessment of The Quality of Long-Term Management of Patients at a Geriatric Hospital

Shalom, Erez, Goldstein, Ayelet, Wais, Roni, Slivanova, Maya, Cohen, Nogah Melamed, Shahar, Yuval

arXiv.org Artificial Intelligence

Background The use of a clinical decision support system for assessing the quality of care, based on computerized clinical guidelines (GLs), is likely to improve care, reduce costs, save time, and enhance the staff's capabilities. Objectives Implement and evaluate a system for assessment of the quality of the care, in the domain of management of pressure ulcers, by investigating the level of compliance of the staff to the GLs. Methods Using data for 100 random patients from the local EMR system we performed a technical evaluation, checking the applicability and usability, followed by a functional evaluation of the system investigating the quality metrics given to the compliance of the medical's staff to the protocol. We compared the scores given by the nurse when supported by the system, to the scores given by the nurse without the system's support, and to the scores given by the system. We also measured the time taken to perform the assessment with and without the system's support. Results There were no significant differences in the scores of most measures given by the nurse using the system, compared to the scores given by the system. There were also no significant differences across the values of most quality measures given by the nurse without support compared to the values given by the nurse with support. Using the system, however, significantly reduced the nurse's average assessment time. Conclusions Using an automated quality-assessment system, may enable a senior nurse, to quickly and accurately assess the quality of care. In addition to its accuracy, the system considerably reduces the time taken to assess the various quality measures.


Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence

Jorge, Emilio, Kågebäck, Mikael, Johansson, Fredrik D., Gustavsson, Emil

arXiv.org Artificial Intelligence

Acquiring your first language is an incredible feat and not easily duplicated. Learning to communicate using nothing but a few pictureless books, a corpus, would likely be impossible even for humans. Nevertheless, this is the dominating approach in most natural language processing today. As an alternative, we propose the use of situated interactions between agents as a driving force for communication, and the framework of Deep Recurrent Q-Networks for evolving a shared language grounded in the provided environment. We task the agents with interactive image search in the form of the game Guess Who?. The images from the game provide a non trivial environment for the agents to discuss and a natural grounding for the concepts they decide to encode in their communication. Our experiments show that the agents learn not only to encode physical concepts in their words, i.e. grounding, but also that the agents learn to hold a multi-step dialogue remembering the state of the dialogue from step to step.